Checkpointing algorithms and fault prediction
نویسندگان
چکیده
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. Key-words: Fault-tolerance, checkpointing, prediction, algorithms, model, exascale ∗ LIP, École Normale Supérieure de Lyon, France † University of Tennessee Knoxville, USA ‡ Institut Universitaire de France § INRIA Étude de l’impact de la prédiction de fautes sur les stratégies de protocoles de checkpoint Résumé : Ce travail considère l’impact des techniques de prédiction de fautes sur les stratégies de protocoles de sauvegarde de points de reprise (checkpoints) et de redémarrage. Nous étendons l’analyse classique de Young en présence d’un système de prédiction de fautes, qui est caractérisé par son rappel (taux de pannes prévues sur nombre total de pannes) et par sa précision (taux de vraies pannes parmi le nombre total de pannes annoncées). Dans ce travail, nous avons pu obtenir la valeur optimale de la période de checkpoint (minimisant ainsi le gaspillage de l’utilisation des ressources dû au coût de prise de ces points de sauvegarde) dans différents scénarios. Ce papier pose les fondations théoriques pour de futures expériences et une validation du modèle. Mots-clés : Tolérance aux pannes, checkpoint, prédiction, algorithmes, modèle, exascale Checkpointing algorithms and fault prediction 3
منابع مشابه
An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملScheduling for fault-tolerance: an introduction
In this chapter, we present scheduling algorithms to cope with faults on large-scale parallel platforms. We study checkpointing and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with fault prediction, and discuss how the optimal period is modified when this combination is used. And finally we follow the very same approach for the combination o...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملImpact of fault prediction on checkpointing strategies
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or windowbased time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the wa...
متن کاملMetapromela: A Toolkit for Simulation of Checkpointing Algorithms
Distributed checkpointing algorithms play an important role in the majority of the fault tolerant software components existent today. Unfortunately, there is a lack of comprehensive and uniform performance testing of those algorithms. Our research focuses on the provision of a toolkit, Metapromela, that helps with the implementation and testing of distributed checkpointing algorithms. This pape...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Parallel Distrib. Comput.
دوره 74 شماره
صفحات -
تاریخ انتشار 2014